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The recently developed ribosome profiling technique (Ribo-Seq) allows mapping of the locations of translating ribosomes 
on mRNAs with subcodon precision. When ribosome protected fragments (RPFs) are aligned to rnRNA, a characteristic 
triplet periodicity pattern is revealed. We utilized the triplet periodicity of RPFs to develop a computational method for 
detecting transitions between reading frames that occur during programmed ribosomal frameshifting or in dual coding 
regions where the same nucleotide sequence codes for multiple proteins in different reading frames. Application of this 
method to ribosome profiling data obtained for human cells allowed us to detect several human genes where the same 
genomic segment is translated in more than one reading frame [from different transcripts as well as from the same rnRNA) 
and revealed the translation of hitherto unpredicted coding open reading frames. 



[Supplemental material is available for this article.] 

The human genome, containing slightly more than 20,000 protein 
coding genes (Clamp et al. 2007), generates a substantially more 
diverse proteome by encoding more than one protein in a single 
gene. The proteome is diversified through a number of molecular 
mechanisms that alter the sequence of the main gene product, 
such as alternative splicing (Matlin et al. 2005), RNA editing (Kiran 
and Baranov 2010; Wulff et al. 2011), utilization of alternative 
translation initiation sites (Ingolia et al. 2011; Ivanov et al. 2011), 
and post-translational modifications (Mann and Jensen 2003). 
However, in addition to modifications of existing protein se- 
quences, examples are known where the same genomic region 
codes for entirely different protein sequences. This occurs when it 
is decoded in alternative reading frames, a phenomenon known as 
dual coding. Dual coding hampers the evolutionary flexibility of 
nucleotide sequences (Firth and Brown 2006; Rancurel et al. 2009). 
Consequently, it is expected to be rare in genomes with weakly 
constrained size, and indeed, it is currently considered to be 
atypical. Nonetheless, comparative sequence analysis provides 
growing evidence that multiple instances of dual decoding do 
occur in humans (Liang and Landweber 2006; Chung et al. 2007; 
Ribrioux et al. 2008). Here we present a method that facilitates 
the detection of dual decoding instances in human using data 
obtained by the recently developed ribosome profiling tech- 
nique (Ingolia et al. 2009; Guo et al. 2010). 

Ribosome profiling is based on the isolation of rnRNA frag- 
ments protected by ribosomes followed by massively parallel se- 
quencing of cDNA libraries derived from the ribosome protected 
fragments (RPFs). The technique allows mapping the locations 
of translating ribosomes on the entire set of rnRNA molecules 
produced under given physiological conditions, thus providing 
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a unique opportunity to obtain quantitative genome-wide in- 
formation on protein synthesis (GWIPS) (Weiss and Atkins 201 1). 

This is important since protein abundance is mainly regulated 
at the level of protein biosynthesis (Schwanhausser et al. 2011). 
The area of GWIPS is rapidly growing. Since the publication of the 
technique in 2009 (Ingolia et al. 2009), an increasing number of 
studies have been carried out using the ribosome profiling tech- 
nique (Guo et al. 2010; Ingolia et al. 2011; Oh et al. 2011; Stadler 
and Fire 2011; Brar et al. 2012; Reid and Nicchitta 2012). 

When RPF sequences are aligned to rnRNA, a characteristic 
triplet periodicity can be observed for the locations of the 5' ends 
of the RPFs. Such triplet periodicity was observed in ribosome 
profiling experiments carried out in both yeast (Ingolia et al. 2009) 
and human cells (Guo et al. 2010). 

The triplet periodicity observed in human cells (HeLa) is il- 
lustrated in Figure 1A. This periodicity occurs because ribosomes 
move not by one, but by three nucleotides, one codon at a time. As 
a result when RPFs are aligned to rnRNA sequences, the majority of 
RPF 5' ends align at a specific distance from the first nucleotide of 
the A-site codon of the elongating ribosome. Allowing 15 nt for the 
distance from the decoding center to the 5' end of an RPF (Guo 
et al. 2010), the RPFs align predominantly to either the first or the 
third positions of the A-site codon, as can be seen in Figure 1 A. The 
second position has the lowest proportion of matching RPFs. Thus, 
the phase of the triplet periodicity can be used as a signature of one 
of the three potentially translated reading frames. Therefore, by 
analyzing the periodicity of aligned RPFs, it is possible to de- 
termine the frame that is being translated. 

From the aligned RPFs for each rnRNA transcript, the sub- 
codon profile can be generated to determine the translated reading 
frame. A schematic representation of how a subcodon profile is 
generated is given in Figure IB. Subcodon position 2 typically has 
the lowest number of RPFs, and this feature can be used as a sig- 
nature for detecting which out of the three reading frames is being 
translated. Moreover, this feature can be used for detecting shifts 
between reading frames, such as the one known to occur in the 
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Figure 1 . Utilization of triplet periodicity for detecting translated reading 
frames. (A) A plot of the number of RPFs aligning to particular mRNA 
positions between the 30th and the 47th nucleotide downstream from the 
start codon aggregated over 6000 human RefSeq mRNAs. In each codon, 
subcodon position 2 is shown as a red bar, while subcodon positions 1 and 
3 are shown as blue and green bars, respectively. (B) A schematic repre- 
sentation of the generation of a subcodon profile from the corresponding 
RPF profile. Each subcodon position (blue indicates 1; red, 2; green, 3) is 
shown on separated plots. (C) The absolute number of RPFs aligning to 
each subcodon position is shown for the coding region of human Anti- 
zyme 1 (OAZ1) mRNA. The location of the programmed ribosomal 
frameshift site is indicated by a broken black line. (D) The distribution of 
the number of RPFs aligning to different subcodon positions, upstream of 
the frameshift site (left) and downstream (right). It can be seen that the 
subcodon position with the lowest RPF count shifts from the second to the 
third upon ribosomal frameshifting. 



expression of human ornithine decarboxylase antizyme 1 (OAZ1) 
gene (Matsufuji et al. 1995). The OAZ1 mRNA subcodon profile is 
shown in Figure 1C. It can be seen that when RPF counts are sep- 
arated by their subcodon positions (phased relative to the start 
codon) (Fig. ID), there is a transition between the proportions of 
RPFs aligning to each position. While the second position has the 
lowest number of RPFs upstream of the frameshift site, it is the 
third position that has the lowest number of RPFs downstream 
from the frameshift site. This is consistent with the +1 direc- 
tionality of the ribosomal frameshifting (the second coding ORF 
in the OAZ1 mRNA is in the +1 frame relative to the first ORF). 

To find other mRNA sequences where reading phase transi- 
tions occur, we developed a computational approach for the 
analysis of subcodon profiles. This method exploits the subcodon 
RPF periodicity signature to identify mRNA transcripts with puta- 
tive reading frame transitions. Application of this method to 
a number of human mRNAs for which ribosome profiling data are 
available allowed us to detect dually coding regions of the human 
genome, where the same nucleotide sequence is used to encode 
protein sequences in more than one reading frame. 

Results 

Periodicity Transition Score (PTS) 

The most intuitive approach for determining the reading frame 
would be a sliding window to monitor the transition of the lowest 
proportion from one subcodon position to another. However, our 
empirical investigation of such an approach demonstrated that it is 
impractical for the type of data currently generated by the ribo- 
some profiling technique (see Supplemental Figs. SI, S2). This is 
largely due to the high nonuniformity of the RPF distribution. 
While certain coding locations of mRNAs have a large number of 
aligning RPFs, the majority of mRNA coordinates have no RPFs 
aligning to them. In the top 10 expressed genes (Fig. 2A, left), 
— 16% of CDS codons have no RPFs aligning to them. This in- 
creases to just over 63% when we expand the pool to the top 1000 
expressed genes (Fig. 2A, right). On the other hand, —24% of the 
top 10 expressed genes have CDS codons where over 100 RPFs 
align, while <2% of the top 1000 expressed genes have CDS codons 
where over 100 RPFs align. This heterogeneity may arise from 
biases introduced during the experimental protocol, oligonucleo- 
tide adapter ligations to the 3' and 5' ends of short reads for cDNA 
library preparation introduce biases that may result in the over- 
representation or underrepresentation of some RPFs (Hafner et al. 
201 1), but also very likely reflects authentic features of translation. 
Certain locations are translated significantly slower than other 
mRNA locations (Ingolia et al. 2011). For example, according to the 
ramp hypothesis (Tuller et al. 2010a), there is an evolutionary se- 
lection for slowly decoded codons at the beginning of coding re- 
gions, resulting in a relatively higher density of RPFs (Tuller et al. 
2010b). Other regions, where ribosomes move quickly, would have 
insufficient coverage. In addition, we have observed occasional 
single isolated RPF peaks in subcodon profiles for the second sub- 
codon position. Occunences of such peaks in regions with otherwise 
no RPF coverage result in false positives. Such peaks are not neces- 
sarily artifacts of the ribosome profiling method but may reflect 
authentic features of translation. In such locations, it is possible that 
the size of the region covered by the ribosome may differ from the 
average due, for example, to specific interactions with components 
of the ribosome inside the mRNA channel, leading to the generation 
of peaks that are not consistent with the average periodicity. 
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Figure 2. Computational approach for detecting transitions between reading frames and their performance on simulated dual coding. (A) Segments of 
pie charts represent the number of CDS codons with the specific number of RPFs aligning to them for the top 1 0 (left) and for the top 1 000 (right) most 
covered mRNAs from the Guo et al. (201 0) data set. It can be seen that, even for the most RPF-covered mRNAs, many CDS codons have no RPFs aligning to 
them. (B) Calculation of cumulative RPF subcodon proportion differences (CSCPD) upstream of and downstream from a sliding point x. Position a rep- 
resents the annotated CDS start, while position b denoTes the annotated CDS stop. Vertical lines represent RPFs that align at given CDS coordinates. (C) Principle 
of the automated scoring scheme, Periodicity Transition Score (PTS). PTS is calculated as the area (shaded in pink) where CSCPD over the examined CDS 
exceeds the expected level as estimated from the 95th quantile CSCPDs of the 1000 mRNA transcripts with the highest RPF coverage. For details, see 
Results. (D) Boxplots representing the distributions of PTS scores (/-axis) obtained for real ribosome profiles for mRNAs with artificially introduced 
frameshifts at different locations relative to the ends of CDS (x-axis). (£) Distribution of PTS for ribosome profiles on simulated mRNAs containing si- 
multaneously translated dual coding regions of different lengths. The simulations were carried out for three sets of mRNAs with different RPF density as 
indicated in the figure. The shaded areas represent the lower and upper quartile intervals for each RPF density. (F) Distribution of PTS for simulated mRNAs 
containing dual coding regions with varying densities of RPFs in the alternative frame. Shading is as in f. 



To overcome the problems of profile heterogeneity and 
local RPF length nonuniformity, we devised a different approach 
to assess whether a transition between frames exists in the ri- 
bosome profile of a particular mRNA. The approach is based on 
a sliding point where cumulative proportions of RPFs aligning 
with particular subcodon positions are calculated upstream 



of and downstream from this point as described below (see 
Fig. 2B). 

For a protein coding region between coordinate a and co- 
ordinate b, we can represent a ribosome profile as an array of the 
number of RPFs aligning their 5' ends to a particular position, for 
example, if w f a+1 , f a+2 . . . f b . 2 , f h -i, fb)- For each coordinate x 
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within the CDS, the proportion of RPFs corresponding to a partic- 
ular codon position is calculated for the upstream q u and the 
downstream q d regions as follows: 

x b 

1ft 1 ft 

<&(*)= -hr- and <&(*)= -r*r—> 
ZZf! I I fi 

j=l i=a j=l i=x+l 

where n indicates a position (1, 2, or 3) within a codon relative to 
the first nucleotide of the start codon. We define the cumulative 
subcodon proportion difference (CSCPD) as the absolute differ- 
ence between the upstream and downstream proportions: 

CSCPD n (x)=\q n u (x)-q n d (x)\. (2) 

The approach is advantageous in that it increases the size of 
the informative region while the effect of false signals generated by 
isolated RPFs is reduced. 

The statistical confidence of the CSCPD estimation is low 
when x is close to a or b due to the limited number of RPFs in either 
the upstream or the downstream region. To account for this, we 
computed the CSCPD curves for each of the 1000 mRNAs with the 
highest number of RPFs from the Guo et al. (2010) data set and 
used the 95th percentile for each subcodon position as a threshold 
over the length of the CDS. To address the differences in CDS 
lengths, the CDS coordinates of each mRNA were normalized into 
their relative positions within the CDS, where the length of the 
CDS is considered to be 1 (each CDS coordinate is divided by the 
total length of the coding region with the start taken as 0.0 and 
the stop as 1.0). Each CSCPD curve is evaluated at 100 equi- spaced 
normalized CDS positions between 0 and 1 using smoothing spline 
interpolation. A pointwise 95% confidence envelope for each 
subcodon position (C x , C 2 , C 3 ) was then obtained from the 95th 
percentiles of the 1000 CSCPDs at each normalized CDS position 
(see Fig. 2C; Supplemental Fig. S3). 

Under ideal conditions for detecting a frame transition (high 
coverage and uniform distribution of RPFs with the second subcodon 
position counts being higher than the first and the third subcodon 
position counts), there should be a local point at which the CSCPD 
reaches its maximum, and such a point should correspond to the 
location of the frame transition (see Supplemental Material section 
entitled Testing PTS on Simulated Dual Coding Sequences). 

Thus, we used the area of CSCPD excess over the 95th per- 
centiles for each subcodon position. The Periodicity Transition Score 
(PTS) is calculated as the sum of excess areas for each subcodon 
position (PTSi, PTS 2 , PTS 3 ) (Fig. 2C). An example of a PTS plot for 
an mRNA with a known case of programmed ribosomal frame- 
shifting (human Antizyme 1 mRNA) is shown in Supplemental 
Figure S4. 

To determine the threshold of PTS that can be used as an in- 
dicator of a frame transition in an mRNA, we calculated the PTS 
scores for a random 1000 mRNAs from the pool of 6000 most- 
covered genes (but outside of the pool of genes used for the 95th 
percentile calculations) and compared them with the PTS scores 
obtained for the same 1000 mRNAs, after introducing single nu- 
cleotide deletions to mimic translational frame transitions at dif- 
ferent locations in the mRNA. The results of these comparisons are 
shown in Figure 2D. It can be seen that before introducing an ar- 
tificial frameshift, the majority of mRNAs have a PTS below 10. 
Since it is expected that some of the 1000 mRNAs may have nat- 
urally occurring transitions, we decided to use a PTS of 10 as the 
threshold for selecting the candidates reported in this study. As can 



be seen from Figure 2D, when using a PTS threshold of 10, the 
potential false-negative rate is higher if a reading frame transition 
occurs closer to either end of the coding region than if the transi- 
tion occurs closer to the middle of the main reading frame. To es- 
timate the P- value for transcripts with a PTS score of 10 and higher, 
we permuted the RPF densities of the 1000 most highly expressed 
transcripts and generated 1 million artificial transcripts (for de- 
tails, see Supplemental Material). Transcripts with a PTS equal 
to or higher than 10, were considered as false positives. This 
yielded a P-value of 0.057. After removing cases where subcodon 
positions 1 and 3 contribute to a high PTS with no contribution 
from subcodon position 2 (see section Further Refinements of PTS), 
the re-estimated P-value for a PTS of 10 or higher, drops to 0.0084. 

The above simulations of dual coding regions addresses a 
simple case, where the transition between alternative frames oc- 
curs at a specific location and all ribosomes continuing translation 
shift their reading frame. Such a situation occurs in the case of ri- 
bosomal frameshifting in OAZ1 mRNA (Fig. 1C,D). However, with 
other examples of dual coding, certain sequence segments could be 
translated in two alternative frames. To explore how PTS performs 
on such cases of dual coding and how different features of dual 
coding affect PTS, we carried out additional simulations that are 
described in the Supplemental Material section Testing PTS on 
Simulated Dual Coding Sequences. Figure 2, E and F, shows how 
PTS depends on features of dual coding, such as the length of 
the overlapping region and the density (absolute and relative) 
of RPFs. 

Further refinements of PTS 

After the PTS had been computed for a set of mRNAs for which 
ribosome profiles are available, it appeared that the PTS performs 
well in predicting mRNAs with reading frame transitions. For exam- 
ple, two known cases of ribosomal frameshifting (OAZ1 [Matsufuji 
et al. 1995] and PEG10 [Shigemoto et al. 2001; Manktelow et al. 
2005]) had PTS scores among the highest 10. 

However, a large source of false-positive cases (from manual 
examination) was found in situations where the PTS is high due to 
mutual fluctuations in the proportions of RPFs corresponding to 
the first and the third subcodon positions, with the second posi- 
tion proportion unaffected. That is, PTS 2 , calculated for the second 
position alone, is low. In our experience, the profiles containing 
bona fide transitions (either known or artificially introduced) al- 
ways result in a high PTS 2 , along with an increase in either PTSi or 
PTS 3 . This is because the number of RPFs aligning to subcodon 
position 2 increases as a result of the transition in the reading 
frame. Correspondingly, the number of RPFs aligning to subcodon 
position 1 or 3, depending on whether the alternative frame is +1 
or -1, decreases. Therefore, in this work, all mRNA sequences with 
PTS 2 lower than both PTSi and PTS 3 were removed without further 
analysis. 

In addition, our empirical manual analysis revealed many 
false positives with less than 100-or-so RPF locations per mRNA. 
Therefore we decided to apply an additional filter — the minimal 
number of RPF locations required in subcodon position 2 — to re- 
duce the false-discovery rate. We found that 12 RPF locations or 
more in subcodon position 2 greatly reduced the inclusion of false 
positives. To reduce the effect of single high peaks in subcodon 
position 2 contributing to a high PTS, we removed the highest RPF 
peak in position 2 and recalculated PTS for each candidate. We 
then used this corrected PTS to score the candidates (see Supple- 
mental Table SI). 



2222 Genome Research 

www.genome.org 



Detection of dual decoding with ribosome profiling 



Yet further manual examination of profiles revealed that a 
high PTS can occur for an mRNA transcript whose profiles are in- 
consistent with the behavior expected in the case of a frame 
transition. The most prominent example of how a high PTS could 
be generated for an mRNA without a transition is the existence of a 
single isolated peak of RPFs corresponding to a second position 
within a codon. As discussed earlier, such peaks do not necessarily 
reflect fluctuations in the noise of the technique but could be due 
to a systematic alteration of RPF length in a local region in a se- 
quence-dependent manner. However, irrespective of the origin of 
such peaks, they significantly contribute to the PTS and generate a 
large number of false positives in our analysis. 

Dual coding genomic sequences 

Manual evaluation of subcodon profiles allowed us to categorize 
these candidates into six groups as outlined in Figure 3A. The 
functional categories illustrated in Figure 3B include (1) instances 
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Figure 3. Classification of dual coding regions. (A) Classification and 
PTS of 108 candidates. (B) Schematic organization of three major classes 
of dual coding. pORFs are shown as light blue bars and alternative 
frames as light pink bars. Splicing organization: green bars correspond 
to exons included in transcript variants, and lines indicate intronic re- 
gions excised during splicing. 



where dual coding occurs due to overlaps between regulatory 
upstream ORFs (uORFs) and main protein coding ORFs (pORFs); 

(2) overlaps between pORFs and nonupstream ORFs (nORFs); and 

(3) transcript variants generated as a result of alternative tran- 
scription initiation or alternative splicing. 

The second class of dual coding is the most surprising, as 
translation of uORFs and dual coding due to alternative splicing 
has been documented previously. Fifteen mRNAs in our set were 
classified as containing non-upstream protein coding ORFs (nORFs) 
(see Fig. 3 A; Supplemental Table SI). The subcodon profile of the 
top-scoring nORF, neuronal transcription factor NPAS2 (RefSeq 
mRNA NM_002518), is shown in Figure 4 A (left panel) and Sup- 
plemental Figure S15. This candidate has RPFs aligning in an al- 
ternative ORF that is located close to the 5' UTR. Comparative 
analysis of the genomic sequences revealed absence of stop codons 
in the alternative ORF in 22 of the 23 available vertebrate NPAS2 
orthologs (see alignments in Fig. 4C). The exact sequence in 
the vicinity of the predicted start codon is CTAArGG ATG AAG 
ATGAGAA (where ATG codons are shown in bold, the predicted 
pORF start codon is also in italics, and the alternative frame start 
codons are underlined; for simplicity and consistency, we use T 
to denote both uridines and thymidines here and elsewhere). 
It is plausible that start codons in such close proximity to each 
other compete for initiation (Matsuda and Dreher 2006), and 
therefore, the role of the alternative ORF may be regulatory. It 
would be very interesting to investigate a potential relationship 
between such regulation and a function of NPAS2 as a part of a 
molecular clock in the human brain (Reick et al. 2001). A some- 
what similar situation of competing initiator ATG codons is ob- 
served in initiation factor EIF4E2 mRNA (Refseq NM_004846) (see 
Supplemental Fig. SI 7). Such a competition could be regulated 
by changes in the stringency of start codon selection, which has 
been shown to be mediated by EIF1 and EIF5 factors (Loughran 
et al. 2011). 

Among all nORF candidates, about one-half are situated en- 
tirely within the corresponding pORF (nested nORFs), while the 
other half extend into the 3' UTRs. 

The largest class of dual coding genomic sequences (29 in- 
stances) corresponds to regulatory uORFs overlapping pORFs. The 
profile of the highest scoring uORF candidate, transcription sup- 
pressor THAP7 mRNA, is shown in Figure 4A (right panel) and 
Supplemental Figure S50. A significantly higher density of RPFs is 
observed in the region of the uORF that overlaps the pORF. In- 
terestingly, the highest peak of RPF density is situated near the stop 
codon of the uORF. Perhaps ribosomes stall at the end of this uORF 
in a manner similar to the well-established ribosome stalling me- 
diated by the MAGDIS peptide encoded by the uORF in S-adeno- 
sylmethionine decarboxylase (AMD!) (Hill and Morris 1993) or 
by a specific mRNA-binding protein as in the regulation of the 
MSL2 mRNA by Sex lethal (Medenbach et al. 2011). Comparative 
sequence analysis of available THAP7 orthologs from the genomes 
of 19 vertebrates (Fig. 4C, right) suggests that the amino acid se- 
quence of the THAP7 uORF evolved faster than the protein se- 
quence encoded in the same region by the pORF frame. However, 
none of the sequences from other vertebrates contain stop codons 
within the region corresponding to the THAP7 uORF. Moreover, 
the position of the uORF stop codon is almost universally con- 
served among the analyzed orthologs. This points to the evolu- 
tionary significance of this uORF and suggests that the significance 
of its translation may be mainly regulatory rather than for the 
production of a functional protein product. It also highlights the 
limitations of dual coding detection by comparative sequence 
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Figure 4. Dual coding in NPAS2 mRNA due to the presence of a translated nonupstream ORF and in THAP7 mRNA due to the overlap of the main ORF 
with an uORF. (A) Subcodon profile (top three rows) and mRNA-seq (fourth row) for NPAS2 mRNA (left; NM_002518) and THAP7 mRNA (right; 
NM_001 008695). CDS coordinates are marked with dotted vertical lines. (B) ORF organization of NPAS2 mRNA (left) and THAP7 mRNA (right). The three 
reading frames are indicated as 1 , 2, 3. Blue vertical lines indicate stop codons and start codons are indicated in red. Annotated CDS is shaded in light blue. 
The areas where translation in alternative frames is detected are shaded in light pink. (C) Comparative analysis of orthologous genomic sequences from 23 
vertebrate species for NPAS2 (left) and from 1 9 vertebrate species for THAP7 (right). Colored bars represent codon substitutions within multiple sequence 
alignments for the standard (top) and alternative (bottom) reading frames (detailed alignments are in Supplemental Figs. SI 21, SI 22). Dark green and 
light green boxes correspond to synonymous and positive (in the BLOSUM62 matrix) substitutions, respectively; red boxes correspond to negative (in 
BLOSUM62 matrix) nonsynonymous substitutions. Gaps are shown in yellow and stop codons are in black. Stop codons are also aggregated across the 
entire alignment beneath each bar. Plots of coding likelihood are shown underneath the colored bars for both reading frames as calculated with MLOGD. 
Synonymous position conservation for the standard translation phase (pORF) is shown above the colored bar. (D) Exon organization of the NPAS2 locus 
(left) and the THAP7 locus (right). CCDS and RefSeq gene tracks from the UCSC Genome Browser are shown in green and blue bars, respectively. 
Alternatively decoded regions are indicated in red. 



Alternative 
translation 



analysis, since alternatively translated regions do not necessarily 
evolve under the same evolutionary constraints as protein coding 
regions. 



In nearly half of the detected translated uORFs, we failed to 
find suitable ATG codons for initiation of uORF translation. This 
could be either due to non-ATG initiation (Ingolia et al. 2009, 
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2011; Ivanov et al. 2011), incompleteness of the corresponding 
RefSeq mRNA at the 5' end, or differences among alternative splice 
variants in the 5' UTR. 

Another source of dual coding in the human genome is al- 
ternative splicing (Liang and Landweber 2006). Some transcript 
variants contain sequences originated from the same genomic loci, 
but in different translational phases relative to the initiation co- 
don. An established case where the same exon is translated in two 
alternative frames is the CDKN2A (also known as INK4a) (Quelle 
et al. 1995) gene. Among our candidates, we have identified 16 
instances of dual coding that can be attributed to alternative 
splicing events or to initiation of transcription at alternative starts 
(see Supplemental Table SI). The top-scoring candidate among 
identified cases is Cllorf48 (Fig. 5; Supplemental Fig. S67). The 
majority of RPFs for Cllorf48 mRNA, are located at the end of 
the predicted RefSeq pORF and extend into the 3' UTR (Fig. 5 A). 
Examination of mRNA-seq reads for Cllorf48 revealed that mRNA- 
seq density is increased in the area of the alternatively decoded 
region (Fig. 5 A, pink area). This indicates that the RPFs are likely to 
originate from the translation of additional transcripts whose se- 
quences are not included in the RefSeq database. Indeed, such a 
transcript exists among Ensembl transcripts (accession number 
ENST00000524958). The subcodon profile that has been generated 
for ENST00000524958 and the distribution of RPFs is consistent 
with the CDS predicted for that transcript (Fig. 5 A, right). Addi- 
tional independent evidence that the area of high RPF density 
encodes a protein product in an alternative frame corresponding to 
ENST00000524958 transcript is provided by evolutionary analysis. 
The multiple alignment of genomic sequences corresponding to 
the Cllorf48 orthologs from 15 vertebrate species is shown in 
Figure 5C. It can be seen that codon substitutions in the area with 
high RPF density are consistent with purifying selection acting 
on ENST00000524958 CDS, which is in the +1 frame relative to 
RefSeq CDS. Also, it can be seen that conservation of synonymous 
positions in pORF codons (0 frame) are markedly elevated for the 
region corresponding to high RPF density. Strikingly, it can also be 
seen that conservation, positive coding likelihood, and a lack of 
stop codons in the +1 frame are observed only for the short region 
with high RPF coverage and not for the full ORE Moreover, Oyama 
et al. (2007) has detected expression of this alternative protein using 
mass spectrometry. Thus, the Cllorf48 locus is an example of a sit- 
uation where the same genomic sequence is simultaneously 
translated in different frames in two alternative transcripts that 
coexist in HeLa cells. 

The situation where two alternative transcripts coexist and 
are translated at the same time is not always the case. We also 
found situations where only one transcript is present in the cell 
under the given conditions. Such an example is PHPT1 mRNA 
(sixth top in Supplemental Table SI), which is illustrated in Figure 
6 and Supplemental Figure S60. The PHPT1 gene contains four exons. 
Two mRNA transcripts are known for this gene: NM_001 135861 
contains all four exons and encodes isoform 2 of PHPT1, while 
NM_014172 lacks the third exon and encodes isoform 3 oiPHPTl. 
As a result of exon skipping, the 3 '-terminal exon is positioned in 
different reading frames relative to the initiation codon in these 
two transcripts (Fig. 6D). The ribosome profile was initially built for 
the transcript with the longest isoform (see Methods). However it 
produced a high dual coding score because RPFs at the 3 ' end of the 
CDS originate from the alternative transcript where this region is 
in a different frame. The analysis of RNA-seq fragments (Fig. 6A) 
shows the lack of fragments conesponding to the skipped exon, thus 
suggesting that only the short transcript is expressed in HeLa cells. 



Although we were able to identify many dual coding regions, 
a number of mRNAs with a high PTS are false positives. Ribosome 
profiles of about a third of all candidates produced high scores for 
reasons other than dual translation. The most prominent example 
(fourth top in Supplemental Table SI) is a profile for the dystrophin 
DMD mRNA (RefSeq NM_004010). The subcodon profile for this 
mRNA (Supplemental Fig. S90) is inconsistent with dual decoding 
and scored highly due to the limitations of our computational 
technique (for details, see Supplemental Discussion). In addition 
to the 33 false-positive candidates, 13 candidates have subcodon 
profiles that suggest dual coding, but dual coding cannot be ex- 
plained by their ORF organization. These unexplained cases are 
discussed in the Supplemental Discussion. 

Discussion 

Our work demonstrates the applicability of the ribosome profiling 
technique for the detection of translated reading frames in human 
mRNAs. This allowed us to identify a number of genomic loci that 
are being translated in more than one frame. An immediate simple 
question raised by this study is how many dually decoded regions 
are in the human genome. A primitive extrapolation of the num- 
ber of cases identified among 6000 genes would indicate —1%. 
However, this is clearly an underestimate for the following reasons. 
First, RPF coverage for the majority of analyzed mRNAs is lower 
than what is required for detecting such regions. Second, the 
method allows dually coded regions to be detected only if the al- 
ternative frame has RPF coverage comparable to, or higher than, 
that of the standard frame (see Fig. 2E,F; Testing PTS on Simulated 
Dual Coding Sequences in the Supplemental Material). It is rea- 
sonable to expect that there are many cases where an alternative 
frame is translated less efficiently than the standard one. More 
sensitive statistical techniques coupled with deeper ribosome 
profiling are needed for the detection of such cases. Third, ri- 
bosome profiling experiments were carried out under particular 
conditions. Dual decoding is likely to be regulated. Therefore, only 
the standard frame may be translated under particular experi- 
mental conditions. Finally, it is likely that dual coding is more 
prevalent in low expressed genes, since highly expressed genes are 
optimized for efficient translation and their coding sequences are 
too restrained to accommodate additional coding information. 
The pool of genes analyzed in this study, however, is limited to 
highly expressed genes. 

This is supported by comparison of our list of dual coding 
candidates with sets of genes that have been predicted as dual 
coding in previous studies. We have been able to identify reading 
frame switches in two (OAZ1 and PEG10) out of six known cases of 
programmed ribosomal frameshifting in humans (Bekaert et al. 
2010), three of which did not have sufficient coverage by RPFs. Our 
method did not detect the well-established examples of dual 
coding: GNAS, XBP1 (Calfon et al. 2002), and CDKN2A (also 
known as INK4a) (Quelle et al. 1995). In the case of GNAS 
(Nekrutenko et al. 2005), the dual coding isoform was not ex- 
pressed under the conditions of Guo et al. (2010). The XBP1 gene 
produced a high PTS, but failed to pass our additional filters (see 
Methods). The longest isoform of CDKN2A, which encodes the 
16INK4a protein, had a low PTS and consequently did not appear 
in our final list of candidates. Among the 40 genes predicted in the 
Chung et al. (2007) study, transcripts from 12 are part of the 6000 
pool that we have used. Three of these ARF-containing genes 
(DNMT3A, BBX, and RBAK) have a PTS > 10 but were removed 
from our final list of 108 genes by subsequent filters explained 
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Figure 5. Dual coding in CI lorf48 locus. (A) Subcodon profile and mRNA-seq for RefSeq mRNA NM_024099 (left) and predicted Ensembl transcript 
ENST00000524958 (right). (B) ORF organization of NM_024099 mRNA (left) and ENST00000524958 (right). (C) Comparative sequence analysis of 
corresponding genomic alignments from 1 5 vertebrate species for RefSeq mRNA NM_024099. (D) Exon organization of the CI 1orf48 locus. For detailed 
description, see legend to Figure 4. The higher density of mRNA-seq reads for NM_024099 (fourth row panel A, left) in the shaded pink area indicates that 
RNA-seq reads are being generated from an additional transcript variant corresponding to Ensembl transcript ENST00000524958. In panel C, it can be 
seen that for most of the predicted CDS, codon substitutions are consistent with RefSeq CDS predictions (the area is greener in the zero-frame). However, 
for the pink shaded area, substitutions are consistent with protein coding evolutionary signatures in the +1 frame. It can be seen that the coding likelihood 
for the +1 frame exceeds the threshold in the area of dual decoding. The conservation plot of synonymous codon positions, shown above the 0 frame, 
shows that conservation of synonymous positions is significantly higher in the shaded pink area. This is consistent with the purifying selection acting on 
protein coding sequences in two frames in this region. 



in Methods. Comparative sequence analysis of 29 mammalian 
genomes revealed 19 (12 sense and seven anti-sense) novel dual 
coding gene candidates, of which six (sense) were among the 6000 
transcripts to which our method was applied (Lin et al. 201 1). One 
of them, UBE2E2 (NM_1 52653), was identified as dual coding in 



our study. The discrepancies between our data set and previous 
predictions do not invalidate either predictions or our approach. It 
is possible that the translation of an alternative frame in previously 
described candidates does not occur in HeLa cells or occurs at a rate 
that is insufficient to be detected by our method. 
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Figure 6. Dual coding in alternatively spliced PHPT1 exon. (A) Subcodon profile and mRNA-seq for PHPT1 mRNA variant NM_001 1 35861 (left) and 
variant NM_014172 (right). (B) ORF organization for NM_001 135861 (left) and NM_014172 (right). (C) Analysis of codon substitutions within the 
multiple alignments of orthologous genomic sequences for NM_001 1 35861 . (D) Exon organization of the two PHPT1 mRNA variants. For notations, see 
legend to Figure 4. Subcodon profiles for variant NM_001 1 35861 (panel A, left), which is the longest isoform (see Methods), indicate that while the 
translated frame is the same as the CDS for most of the CDS region (low RPFs density for the second [red] position), the sequence is translated in the +1 
frame relative to the CDS frame at its end and downstream (pink shaded area). In addition, there is an evident gap in translation in the subcodon profile and 
mRNA-seq just prior to the pink shaded area, which corresponds to the third exon in PHPT1 mRNA variant NM_001 1 35861 (panel D). As a result, the 
fourth exon in the NM_001 1 35861 mRNA is in an alternative frame relative to the CDS start codon. Codon substitution analysis of multiple sequence 
alignments (panel Q is consistent with the dual decoding of the 5' end of the fourth exon. Synonymous and positive nonsynonymous substitutions are 
predominant in both the zero and +1 frames in the locations where RPFs are found. 



The present study demonstrates that the coding and 
translational landscape of the human genome is more sophis- 
ticated than previously appreciated. Further development of 
high-throughput approaches for studying translation, combined 



with the growing power of comparative sequence analysis, 
provides an opportunity for obtaining quantitative information 
on the versatility of decoding and translation at the whole-cell 
level. 
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Methods 

Analysis of 6000 human mRNAs 

The predictive power of PTS depends on the number of RPFs that 
can be aligned to an mRNA sequence (see Supplemental section 
Testing PTS on Simulated Dual Coding Sequences). Therefore, we 
restricted our analysis to a limited set of mRNAs for which a com- 
paratively high number of RPFs are available. We used the 6000 
mRNA sequences with the highest RPF coverage from the data set 
that has been reported by Guo et al. (2010). We found that RPF 
coverage lower than that of the top 6000 is insufficient for statis- 
tically reliable determination of reading frames. Over 800 mRNA 
profiles scored above the PTS threshold. One hundred eight of 
these passed the filters described in the Results section. This set 
of mRNA profiles was analyzed manually using additional infor- 
mation in order to explain the predicted frame transitions for each 
case. The analysis involved manual examination of the ORF or- 
ganization and examination of the subcodon profiles for the entire 
mRNA (as opposed to the analysis of just the previously annotated 
CDS region for the calculation of the PTS) in conjunction with 
"naked mRNA" profiles (RNA-seq). The analysis of the entire 
mRNA profile is required for those cases where an alternatively 
translated region corresponds to an ORF that overlaps the 5' or 3' 
end of the previously predicted CDS. In such situations, we expect 
RPFs to also occur outside the CDS region. Such a distribution can 
then be used as additional evidence that the high PTS results from 
dual translation. We also expect RPF subcodon proportions to be in 
accordance with the reading frame of the new ORF in the region 
in which it does not overlap with the main CDS. In addition, for 
a subset of cases, we examined multiple sequence alignments of 
corresponding genomic regions. The regions that are translated in 
alternative frames are expected to evolve under purifying selection 
with the ratio of nonsynonymous to synonymous substitutions 
being significantly below 1 for the amino acid sequences encoded 
in alternative frames. Also the regions of dual coding are expected 
to have elevated conservation at synonymous codon positions, since 
synonymous positions in one reading frame are nonsynonymous 
in the other. The details of the 108 mRNAs analyzed are given in 
Supplemental Table SI. For each mRNA sequence, we provide in- 
dividual subcodon profiles and plots of ORF organization (Sup- 
plemental Figs. S9-S116). 

There are several mechanisms that can be responsible for the 
dual decoding of the genomic regions identified in this work. The 
method described does not allow discrimination between these 
mechanisms. Our classification of individual cases into mecha- 
nistic categories is based on external information that we have 
obtained from public bioinformatics resources, and therefore, the 
validity of our predictions relies on what is available in those re- 
sources. For example, for a single mRNA variant we may observe 
the predicted CDS to be overlapped by a translated upstream ORF, 
and therefore, we would classify such a case as the translation of 
a uORF overlapping the main CDS. However, we cannot exclude 
the possibility that the ribosome profiles were derived from a dif- 
ferent unreported splicing variant where the regions involved are 
joined together in such a way that the start codon of the uORF 
appears in frame with the previously predicted CDS, and therefore, 
all RPFs have been generated from a single ORF. 



Generation of individual mRNA ribosome profiles 

Short sequence reads (corresponding to RPFs) generated during 
ribosome profiling experiments in HeLa cells (Guo et al. 2010) were 
obtained from the NCBI Gene Expression Omnibus (accession 
GSE22004) (Edgar et al. 2002). The mRNA sequences for the top 



6000 genes from quantification files were downloaded from the 
NCBI Refseq database (Pruitt et al. 2009) in fasta format in January 
2011. Guo et al. (2010) quantification files comprise single RefSeq 
mRNA references for each gene where, for which genes with 
multiple isoforms, the longest isoform is chosen. To maximize the 
total number of RPFs for each gene, short reads from all available 
experiments in HeLa cells (SRR057511, SRR057512, SRR057516, 
SRR057517, SRR057521, SRR057522, SRR057526, SRR057529, 
SRR057532) were aggregated. The aggregated RPFs were then 
aligned, using the Bowtie short read aligner software package 
(Langmead et al. 2009). A seed region of the first 25 nt at the 5' end 
was used according to the method described by Guo et al. (2010). 
However, we allowed zero mismatches in the seed region. 

Individual mRNA subcodon profiles and ORF plots were then 
generated using custom Python and R scripts and the Biostrings 
package from the Bioconductor library (Gentleman et al. 2004). 
It has been shown previously that the distance between the 5' 
end of an RPF and the position of the anticodon in the ribosomal 
A-site is —15 nt (Guo et al. 2010). Therefore, to generate subcodon 
profiles, each RPF was assigned to the mRNA coordinate corre- 
sponding to the 15th RPF nucleotide from the 5' end. Thus sub- 
codon profiles represent the locations of the A-sites of the trans- 
lating ribosomes. 

The CSCPD and PTS were computed using custom scripts in R 
according to the algorithms described in the Results section. 

Comparative sequence analysis 

MULTIZ (Blanchette et al. 2004) multiple alignments for vertebrate 
species were obtained from the UCSC Genome Browser (Fujita 
et al. 2011) and visualized with the aid of a cgi script (kindly pro- 
vided by Mike Lin, CSAIL, MIT) and additionally processed with 
custom R scripts. Sequences containing long consecutive gaps 
(>50 codons) were removed prior to the analysis. The coding 
likelihood for annotated CDS frames and alternative frames was 
quantified using MLOGD according to the method described 
previously (Firth and Brown 2006). Conservation at synonymous 
codon positions in annotated CDSs was computed according to 
the method described previously (Firth and Atkins 2009). Full 
MLOGD and synonymous substitution conservation plots for the 
examples described in the Results section of the manuscript are 
shown in Supplemental Figures SI 17 through S120. 

Data access 

The R scripts for computing the CSCPD and PTS are provided at the 
end of the Supplemental Material and are also available on http:// 
lapti.ucc.ie/bicoding. 
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